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Abstract. Probabilistic Inductive Logic Programming (PILP) is a rel¬ 
atively unexplored area of Statistical Relational Learning which extends 
classic Inductive Logic Programming (ILP). This work introduces SkILL, 
a Stochastic Inductive Logic Learner, which takes probabilistic annotated 
data and produces First Order Logic theories. Data in several domains 
such as medicine and bioinformatics have an inherent degree of uncer¬ 
tainty, that can be used to produce models closer to reality. SkILL can 
not only use this type of probabilistic data to extract non-trivial knowl¬ 
edge from databases, but it also addresses efficiency issues by introducing 
a novel, efficient and effective search strategy to guide the search in PILP 
environments. The capabilities of SkILL are demonstrated in three dif¬ 
ferent datasets: (i) a synthetic toy example used to validate the system, 
(ii) a probabilistic adaptation of a well-known biological metabolism ap¬ 
plication, and (iii) a real world medical dataset in the breast cancer 
domain. Results show that SkILL can perform as well as a deterministic 
ILP learner, while also being able to incorporate probabilistic knowledge 
that would otherwise not be considered. 


1 Introduction 

Statistical Relational Learning (SRL) is a well-known collection of tech¬ 
niques whose main objective is to produce interpretable probabilistic classifiers, 
often in the form of readable logical sentences. While researchers have spent their 
efforts on creating logic languages to represent probabilities and runtime environ¬ 
ments that can deal with them |2:il22l2lf)l4li^ . few works have been dedicated 
to learn rules from probabilistic knowledge. In this work, we introduce SkILL 
- a Stochastic Inductive Logic Learner - which can combine the rule learning 
capability of classic Inductive Logic Programming (ILP) |IIII6) with uncertain 
knowledge as probabilistic annotated data to produce First Order Logic (FOL) 
theories. 

ILP is a machine learning branch which stands out due to its suitability to 
handle relational data. ILP’s main goal is to construct a theory which explains 
a set of observations (called examples), given a set of facts and/or rules which 
are of a relational nature (called background knowledge). The induced theory 
can then be used for prediction (as it can output probability values for a given 
example) as well as classification (as it can also output the specific categorical 
label for an example). Probabilistic Inductive Logic Programming (PILP) [5D] 






extends discrete ILP by considering background knowledge and/or examples 
that are annotated with probabilities. This is a natural extension of ILP and 
can in fact model different semantic scenarios, according to the meaning that is 
assigned to the probabilities. 

Using probabilities to describe data has the potential advantage of greatly 
reducing the dataset size, since useful information can still be extracted from 
marginal distributions. Also, in cases where the full conditional probability table 
is not known, information can still be used efficiently in the computation of 
a rule, for instance, by adding values from the literature in this form to the 
background knowledge. Compressing data in such a way could also be used in 
order to protect private sensitive data. There are surely several other scenarios 
in which probabilities can be applied and taken advantage of. Throughout this 
work, probabilities will be used as marginal distributions (motivational example), 
as a transformation of a numeric attribute in discrete data (metabolism dataset), 
and as an empirical confidence (non-definitive biopsies dataset). 

SkILL can not only use all these types of probabilistic data but it also ad¬ 
dresses efficiency issues by introducing a novel, efficient and effective search strat¬ 
egy to guide the search for FOL theories. SKILL runs on top of the Yap Prolog 
system [T], uses GILPS m as the basis rule generator and MetaProbLog mm 
(an extension of ProbLog |2I8) 1 as the probabilistic representation language. 
Knowledge is thus annotated according to ProbLog syntax and the MetaProbLog 
engine is used to evaluate the probabilities of the generated theories. 

The remainder of this paper is organized as follows. First, a toy example 
is introduced to motivate the transition between ILP and PILP, followed by a 
description of related work. Next, we present the SkILL system and focus on 
some efficiency issues. Then, two experiments are performed to assess SkILL’s 
performance, followed by a discussion of results and the conclusion. 


2 Motivational Example 

Rock-paper-scissors is a game where two players each play one of the three 
objects - either rock, paper or scissors - simultaneously, through movements of 
their hands, and the winner is chosen based on the rules presented in Fig. [T] 
(which use the Prolog syntax). 


beats (Round , Player A , PlayerB ) : — 

plays (Round , Player A ,rock) , plays (Round ,PlayerB , scissors) . 
beats (Round , Player A , PlayerB ) : — 

plays (Round , PlayerA , paper ) , plays (Round , PlayerB , rock ) . 
beats (Round , PlayerA , PlayerB ) : — 

play s (Round ,PlayerA , scissors ) , plays( Round ,PlayerB ,paper) . 


Fig. 1. Rules of the rock-paper-scissors game in Prolog syntax 






If data of this game were recorded, it would contain players’ choices of objects 
for each round as well as the result of each game. This is illustrated in Fig. [51 
where the first argument represents each round (consecutive integers), the second 
argument is the player (playerA, playerB and playerC), and the third argument 
corresponds to each player’s outcome (rock, paper or scissors ). Predicate beats/3 
represents for each round (first argument) which player is the winner (second 
argument), and which one is the loser (third argument). 


plays (1 , player A ,paper). plays(2,playerB,rock). 

plays(l,playerB,scissors). plays(2,playerC, scissors). 

beats(l,playerB,playerA). beats(2,playerB,playerC). 


Fig. 2. Example of full description of game 


Traditional ILP can be used in this problem as is to induce the rules of the 
game. This formulation of the problem is trivial for an ILP engine and it can 
take as few as three examples to learn the three rules of the game. 

SkILL allows for inducing the same set of rules from different background 
knowledge (BK) information. Suppose the information about each round was not 
available and that all available information was the profile/strategy of a given 
player (how often does he/she play each object) and how often did that player 
win against other players. This setting carries much less information because 
nothing is known about the sequence of games or against whom a player played; 
only the marginal distributions are known. Figure |3| presents an example of this 
new form of BK, where semi-colon has the meaning of an exclusive-or connective 
(different from the Prolog syntax). 


0.1:: plays( playerA ,rock ) ; 
0.1:: plays( playerA , paper); 

0.8: : plays(playerA , scissors ) . 

0.4:: beats(player A ,playerB ) . 


0.1:: plays(playerB ,rock) ; 
0.3:: plays(playerB ,paper); 
0.6:: plays(playerB , scissors) . 


Fig. 3. Probabilistic BK of rock-paper-scissors game 


In Fig. 131 rules are annotated according to Halpern’s type I probability struc¬ 
ture [5], where numbers on the left correspond to values of the game domain, 
which can be interpreted as the frequency with which each event happens. Pred¬ 
icates plays and beats have now only 2 arguments because the frequencies of the 
rounds are no longer relevant to the problem. 

Experiments were made by annotating simulated games based on random 
player profiles and SkILL induced the rules presented in Fig. |T] from information 
about the profiles of players using as little as 10 observations and three players. 










3 SkILL 


SkILL is a tool which can extract non-trivial knowledge (FOL theories) from 
probabilistic data. As is the case of ILP systems, SkILL’s setting includes three 
main components: 

Probabilistic Background Knowledge (PBK) represents the basic infor¬ 
mation known about the problem and can be composed of both rules and 
facts, either probabilistic or not. 

Probabilistic Examples (PE) represent the observations the system is at¬ 
tempting to explain. In the classical ILP setting there can be positive and 
negative examples, but in the probabilistic setting that information must be 
encoded as probabilities. These probabilities are the expected values of exam¬ 
ples and can represent either statistical information or the degree of belief in 
an example (using type I or type II probability structures [S], respectively). 
Search Space Constraints mode declarations used to guide the search, whose 
aim is to minimize a loss function. 

Since search spaces are often too large, a common approach is to guide the 
search by using strategies that can lead to good hypotheses without exhaustively 
traversing all the search space. SkILL introduces a novel, efficient and effective 
search strategy to guide the search in PILP environments. 


3.1 Traversing the Search Space 

Algorithm [T] presents SkILL’s main algorithm. The algorithm takes as input the 
probabilistic background knowledge (PBK) and a set of examples (PE) plus pa¬ 
rameters corresponding to the maximum length of a theory (or set of hypotheses) 
to be generated (MaxTheoryLength), the number of hypotheses to be combined 
in order to limit the search space (Psize and Ssize), a metric to rank the selection 
of hypotheses to be combined (RankMetric), and a final metric that is used to 
decide what is the best theory found (EvalMetric). 


Algorithm 1: SkILL Algorithm 


Input = PBK, PE, MaxTheoryLength, Psize, Ssize, RankMetric, EvalMetric 
Output = Best theory according to EvalMetric 

Hypsl = HypsN = AllHyps = generate_hypotheses_length_one(PBK, PE) 
for Length = 2; Length < MaxTheoryLength; Length++ do 
Primary = select-primary_set(HypsN, Psize, RankMetric) 

Secondary = select-secondary_set(Hypsl, Ssize, RankMetric) 

HypsN = generate-Combinations(Primary, Secondary) 

AllHyps = AllHyps U HypsN 
9 end 

10 return best-theory(AllHyps, EvalMetric) 






Initially, the algorithm uses the TopLog engine from the GILPS [T7] ILP 
system, to generate all possible hypotheses composed of only one clause (line 3 
in Alg. [T]). A top level generic hypothesis is constructed from the mode decla¬ 
rations in the PBK and possible hypotheses are generated independently from 
each example using SLD refutation. This approach ensures that each hypothesis 
generated must be entailed by at least one example, and so the hypotheses mirror 
patterns contained in the observations with respect to the PBK. SkILL improves 
on this approach by removing hypotheses which are permutations of each other 
(i.e., syntactically distinct but semantically equal), so that probabilistic inference 
is only performed over semantically unique hypotheses. 

Once hypotheses with length one are generated, the algorithm proceeds by 
generating hypotheses with length greater than one (lines 4-9 in Alg. [IJ until 
reaching a given maximum theory size (argument MaxTheoryLength). Combin¬ 
ing hypotheses in order to generate new hypotheses with larger size is not a trivial 
task - possible combinations are (^) with N being the total number of length 
one hypotheses and K the maximum theory size. Ideally, an exhaustive search 
of the hypotheses space would be performed, but this is computationally taxing, 
particularly as the theory size grows. Therefore, SkILL’s search strategy selects 
candidate hypotheses for two different sets, named Primary and Secondary, and 
new hypotheses are then generated by only combining members of these sets. 
To do so, for each theory length, the algorithm first selects the Primary and 
Secondary sets of hypotheses, with sizes equal to arguments Psize and Ssize, 
respectively (lines 5-6 in Alg.[T]), and then it performs the combinations (line 7 
in Alg.[T]). This procedure repeats until generating hypotheses for all lengths. 

SkILL’s selection procedure has two main goals: (i) reduce the number of 
combinations to be generated without losing the good hypotheses in the process 
and (ii) introduce some stochastic behavior by giving identical opportunity to 
weaker rules whose combination can be of interest. The primary and secondary 
sets can be seen as a way to materialize these two goals, respectively. 

The primary set of hypotheses is considered to be the most relevant, i.e., 
the one holding the best set of hypotheses according to a given ranking metric 
(argument RankMetric). In each iteration of the algorithm, the primary set is 
filled with the Psize best hypotheses from the set of hypotheses generated in the 
previous iteration (1 clause hypotheses when searching for 2 clauses hypotheses; 
2 clauses hypotheses when searching for 3 clauses hypotheses; etc). To rank 
hypotheses, SkILL supports three metrics: RMSE (root mean square error), PAcc 
(probabilistic accuracy) and Random. 

The secondary set is filled with Ssize hypotheses from the set of hypotheses 
with length one. The aim of the secondary set is to include very different can¬ 
didate hypotheses whose combination with the hypotheses from the primary set 
can be of interest. Priority to full stochastic behaviour can be given by randomly 
selecting all the hypotheses for the secondary set, or a selection based on best set 
of hypotheses according to the given ranking metric can be made. Additionally, 
both approaches can be combined in order to obtain a more heterogeneous set. 


In particular, the experimental results presented used a mixed scenario where 
the secondary set always includes the Psize best hypotheses with one clause (i.e., 
the hypotheses selected for the first primary set) plus (Ssize - Psize) randomly 
selected distinct candidates from the remaining hypotheses with one clause. This 
stochastic component of the selection is distinct for each iteration. 

Finally, according to a given evaluation metric (argument EvalMetric), the 
best generated hypothesis for all different lengths is returned (line 10 in Alg. [T|). 


3.2 Evaluation Metrics 

Currently, SkILL implements the RMSE and PAcc metrics. These metrics can 
be used to rank and/or evaluate hypotheses, as mentioned earlier. Since, from 
the point of view of SkILL’s algorithm, the ranking and evaluation phases are 
independent, we have chosen to introduce two different metric arguments instead 
of only one. By doing this, we not only highlight that independence but also do 
not restrict possibly different metric combinations. 

The RMSE metric penalizes predictions farther from the expected values, 
while PAcc is the generalization of the discrete accuracy to the probabilistic 
setting as introduced by De Raedt and Thon [2T] and used by Muggleton [15]. 

The RMSE of a hypothesis H can be defined as: 

RMSEh = ^ E - P{e0f (1) 

where, Pfr(ei) denotes the probability that H together with the PBK entails 
an example e,, and P{ei) denotes the given expected value of an example Cj. 

The PAcc of a hypothesis H is often represented in terms of true positive 
{TP), true negative (TN), false positive {FP) and false negative (FN) examples, 
as shown in Equation [51 


PAcch = 


TP + TN 

TP + TN + FP + FN 


( 2 ) 


From [21], TP + TN+FP + FN = \PE\, and TP and TN are equal to the sum 
over all examples of min{PH{ei), P{ei)) and min{l — Pni^i), 1 — P{Gi)), respec¬ 
tively. Substituting into Equation [5] this gives that PAcc can be also represented 
in terms of the absolute average error between predictions and expected values, 
as shown in Equation [3] 


PAcch = E {'m.in{PH{ei),P{ei)) + min{l - Puiei),! - P{ei))) 

^Tp^\ E {^MPH{ei),Piei)) + l-max{PH{ei),P{ei)) 

' ' ei^PE 

E (l-|Brr(eO-P(eOI) 

I I eiGPE 

' ' PR 


=1 - 








As presented in Equation 21 both metrics can also be defined based on a com¬ 
mon loss function lossH(fii) = Pnisi) — P(ei), which calculates the difference 
between the probabilistic expected value of an example and the value that can 
be predicted w.r.t a given hypothesis and the PBK. 

RMSEh = j^ Y, lossHiaf 

^i^PE ^ ^ 

1 

PAccif = l-—^ Y \lossH{ei)\ 

‘ ‘ ei£PE 

Hence, the aim of SkILL’s search engine is to find the hypothesis with minimum 
RMSE or maximum PAcc in the search space. 

3.3 Pruning Combinations 

PILP shares a similar hypothesis search space as an ILP problem where all 
examples are expected to be true. The difference between the approaches lies 
obviously in the evaluation of hypotheses; in the first case it is a number between 
0.0 and 1.0 representing a probability, whilst in the latter it is either true or false. 

Theories in ILP are constructed by combining several hypotheses through 
logic conjunction (A) and disjunction (V). Let P[i,P [2 be hypotheses; the hy¬ 
pothesis resulting from conjuncting Pli with P [2 is more specific than either Hi 
or H 2 , while the disjunction of Hi and H 2 is more general than either Hi or 
H 2 . Equation [S] shows how an example could be entailed by a disjunction in 
terms of two hypotheses Hi and H 2 . 

HiVHih ei ^Hi A i?2 N ei OR 

Hi A H 2 h= ei OR (5) 

Hi A H 2 h= ei 

Equation [5] can be extended to the probabilistic case using the principle of 
inclusion/exclusion as follows. In the probabilistic scenario, the probability of a 
hypothesis Ph represents the probabilistic mass covered by PBK U H \= true, 
which can range between 0 and 1. As such, the probability of a disjunction of 
hypotheses can be calculated according to the expression shown in Equation 21 

HifivifaCfiO — Puiiei) + PH2{ei) — PHiAH2{ei) (6) 

The PHi/\H 2 {ei) term of Equation 2] is the probability of both hypotheses (con¬ 
junction) entail the example. Erom set theory, three particular cases are known, 
namely: completely overlapping, independent or disjoint masses, and these can 
be calculated according to the expressions in Table [TJ By analysing Table [H it 


Table 1. Special cases of PHivH 2 {ei) from set theory 


Completely Overlapping 

Independent 

Disjoint 

max(PHi (ei), Ph2 (ei)) 

Puiiei) + (1 - PHi{ei)){PH2{ei)) 

Phi (ei) + Ph 2 (ei) 


becomes evident that the probability of the disjunction of two hypotheses has 







clear minimum and maximum boundaries, derived from the cases of completely 
overlapping and disjoint masses, respectively. 

PHivH2{ei) e [max{PHT^{ei),PH2{ei)),min{PHi{ei) + PH2{ei),l-0)] (7) 

The boundaries stated in Equation 0 make it possible to prune combinations 
of hypotheses whose interval of results does not contain the expected value for 
that example. SkILL can use these boundaries to prune combinations in two 
different contexts: (a) before probabilistic inferenee, to avoid performing such 
computations on some combinations, and (b) after probabilistic inference, to 
remove combinations found to be bad after inference. This is part of the gen- 
erate-Combinations() function as illustrated in Algorithm [21 Functions possi- 
bly_good-Comhination() and good_combinations() at lines 8 and 15 in Alg.|2] im¬ 
plement each pruning strategy, respectively. 


Algorithm 2: Function generate combinations() 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


Input = Primary and Secondary sets 

Output = Combination of (not prnned) hypotheses from both sets 
begin 

HypsN = {} 

foreach Hp in Primary do 

foreach Pta in Secondary do 
Ptjiew — Up V tis 

if possibly_good_combination(Hne-w) then 
H(new,prob) = do_problog_inference(i7ne-u,) 

HypsN = HypsN U i7(„e™,pro6) 

end 

end 

end 

end 

return good_combinations(HypsN) 


Finding the best pruning strategies to use these boundaries is not evident, 
since we must take into account the predictions of a hypothesis for all exam¬ 
ples. Because data will most likely not be completely independent or completely 
mutually exclusive, the strategies must consider that the contribution of a hy¬ 
pothesis in a combination of two hypotheses varies greatly and so care must be 
taken not to prune away rules which might have been important. This concept 
is better illustrated in Fig. jT] 

Figure [4(a)| shows the case of combination of hypotheses, where the shaded 
area represents the possible contribution of the disjunction of hypotheses and 
the blue points are the estimated values PHivH 2 i^i) for the disjunction, which in 
the case of our function possibly_good-combination() are the values in the center 
of that interval. 
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(b) A good {Hi) and bad {H 2 ) hypothesis 


Fig. 4. Using the boundaries 


Since hypotheses are being combined using disjunctions, the value of the 
combination for one particular example can only be greater or equal than 
the value of any of the hypotheses in the combination. As such, combinations 
of hypotheses whose result is lower than the expecte d valu es are in principle 
of greater interest for combination than others. Figure 4(b) shows the case of a 
good and a bad hypothesis according to this principle. SkiLL’s pruning functions 
possibly_good-combination() and good_combinations() reflect this by discarding 
the combinations Hnew whose estimated contribution is overall less than the 
expected values P{ei), as shown by Equation|8l 


XI (e“) - > 0 

ei€PE 


( 8 ) 


4 Experimental Settings 

The foremost focus of SkILL is the discovery of non-trivial knowledge from a 
dataset in order to explain observations. The quality of the knowledge discovered 
is currently evaluated by two different metrics (probabilistic accuracy or RMSE). 

Furthermore, the FOL theories found by SkILL can also be used in classi¬ 
fication by introducing a threshold. The threshold could be learned from the 
original observations or be arbitrary chosen. This approach has a benefit over 
classical ILP (such as Aleph) in its capability to cope with noise in the data. 

As such, this work presents experiments of both types: the metabolism dataset 
is used to evaluate the classification accuracy of the system, and a medical 
dataset of non-definite biopsies is used as the basis for extraction of non-trivial 
knowledge in this domain. 

Accuracy is used to evaluate the classifiers, using the standard formula in the 
discrete case (Aleph) and its probabilistic extension as presented in Section [3] 
for the probabilistic case (SkILL). 










4.1 Classification 


The dataset used to assess SkILL’s classification accuraw is the metabolism 
dataset, and is taken from the 2001 KDD Cup Challengqj. Although the chal¬ 
lenge involved learning 14 different protein functions, this experiment focuses on 
a subtask that is to predict which proteins are responsible for metabolism. For 
this purpose, we use a subset of the full dataset containing 230 examples split 
evenly between positives and negatives. Since the dataset is originally discrete, a 
normalization to the interaction (genel, gene2, type, strength) fact in the BK was 
made: interaction’s fourth argument is a numerical argument which represents 
the strength of the interaction between two genes. By transforming interaction ( 
genel, gene2, type, strength) to strength_norm::interaction (genel, gene2, type), not 
only the search space of hypotheses is reduced (because that feature is no longer 
directly considered in the hypotheses generation process), but also predicates 
used to typically compare numerical features in ILP are made redundant in this 
case, since SklLL implicitly attempts to find the hypotheses with the best fit 
to the examples, taking into account the probabilities of the facts in the PBK. 
Finally, we converted the examples from discrete true/false to probabilistic with 
1.0/0.0 probabilities respectively. 

Metabolism is a fairly small dataset: it is composed of 230 examples (half 
positive and half negative) and approximately 7000 BK facts, of which 3200 
are probabilistic. As such, 30 70-30 bootstraps were generated and the results 
presented for all experiments are the average and standard deviation over the 30 
bootstraps test sets (70% of each booststrap cases were used for training and 30% 
for test). Since SkILL provides several configuration options, various scenarios 
were tested in order to compare the results among them. Results presented for the 
Aleph system [21], are collected with the default parameters (except noise, which 
is set to maximum). However, since the BK of metabolism has been altered, the 
systems are not working with comparable data, and so these values are meant 
to be merely informative. 

Table 121 presents a comparison between using a pruning strategy and exhaus¬ 
tively combining Primary (20 hypotheses) against Secondary (200 hypotheses), 
for hypotheses until size 3. The number of hypotheses of size 1 of the train¬ 
ing sets ranges from 2000 to 3000, and so Secondary represents about 10% of 
hypotheses, while Primary represents 1%. This table also presents discrete ILP 
results using Aleph’s default configuration and allowing for maximum noise. 

Table 2. Accuracy of the models on the test set for the metabolism dataset 



Search Strategies 
(RMSE, cr) (PAcc, a) 

SkILL 

SklLL-hpruning 

(0.616, 0.063) (0.661, 0.045) 

(0.581, 0.099) (0.663, 0.045) 

Aleph 

(0.656, 0.047) 


^ http: //www. cs .wise . edu/$\sim$dpage/kddcup2001 







The results in Table [5] show that driving the search with PAcc metric (both 
for evaluation and ranking) produces better classification results (2-tailed t-test, 
p = 0.04) than both the discrete case and when using RMSE as a ranking metric. 
We believe that penalizing greater distances from the expected values (like when 
using RMSE) produces worse results accuracy-wise because of overfitting the 
training dataset. 

Table |3] studies the effect of varying the sizes of Primary and Secondary, 
and how their accuracy and RMSE relate to the sizes of these sets for different 
ranking metrics. 


Table 3. Probabilistic Accuracy of the test set for varying sizes of Primary and Sec¬ 
ondary sets and different search strategies using SkILL with pruning 


Psize/Ssize 

(RMSE-Rand, o 

Search Strategies 

) (RMSE-RMSE, a) (PAcc-Rand, a) 

(PAcc-PAcc, cr) 

10/100 

20/200 

30/300 

(0.586, 0.093) 
(0.583, 0.104) 
(0.575, 0.096) 

(0.583, 0.095) 
(0.581, 0.099) 
(0.612, 0.065) 

(0.663, 0.045) 
(0.663, 0.045) 
(0.663, 0.045) 

(0.663, 0.045) 
(0.663, 0.045) 
(0.663, 0.045) 


From Table [3l it becomes evident that all PAcc measurements are the same 
- this is because the best classifier for this dataset is always a hypothesis of 
length one, and therefore is always considered independently of the population 
and the ranking metric. None of the candidate hypotheses of length greater 
than one results in a better accuracy for this evaluation metric. However, when 
using the RMSE evaluation metric, many different hypotheses are generated, for 
different training datasets. Again, this substantiates the notion that the RMSE 
evaluation metric may be causing overfitting. These results also indicate that 
the difference between using a random or RMSE ranking criterion is negligible 
for small populations sizes (2-tailed t-test with p=0.47 and p=0.89 for 10/100 
and 20/200, respectively). This happens because the best hypotheses ranked 
by RMSE are not good candidates for combination in this case, so the random 
hypotheses are in fact being used in most cases. In the case of 30/300 population, 
the RMSE ranking shows an improvement in the results, but at the cost of 
longer runtime. A random ranking strategy does not require that the population 
be ordered, and when the size of generated hypotheses grows, so does the time 
spent in ordering them. 

4.2 Knowledge extraction 

Breast cancer diagnosis guidelines suggest that patients presenting suspicious 
breast lesions should be sent to perform a diagnostic mammogram and possibly 
an ultrasound, and a core needle biopsy to further define this abnormality. The 
biopsy is very important in determining malignancy of a lesion and usually 
yields definitive results; however, in 5% to 15% of cases, the results are non¬ 
definitive |19j . Routine practice usually sends all patients with non-definitive 
biopsies to excision, even though only a small fraction of them (10-20%) have in 





fact a malignant finding confirmed after the procedure - the remainder of them 
did not need to be subjected to surgery. In the US this represents approximately 
35,000 to 105,000 women who likely underwent excision and a majority of them 
ultimately received a benign diagnosis. 

Although non-definitive biopsies are relatively rare, sending every woman 
that has a non-definitive biopsy to excision is not a good practice. Machine 
learning methods have been used to mitigate this and other problems by allowing 
to produce models of the data that can distinguish between benign and malignant 
cases mm- However, in the medical domain it is crucial to represent data in 
a way that experts can understand and reason about, and as such ILP can 
successfully be used to produce such models. Furthermore, probabilistic ILP 
allows for incorporating in the PBK the confidence of physicians in observations 
and known values from the literature. 

In this study, we use 130 biopsies dating from January 2006 to December 
2011, which were prospectively given a non-definitive diagnosis at radiologic- 
histologic correlation conferences. 21 cases were determined to be malignant after 
surgery, and the remaining 109 proved to be benign. For all of these cases, sev¬ 
eral sources of variables were systematically collected including variables related 
to demographic and historical patient information (age, personal history, family 
history etc), mammographic BI-RADS descriptors (mass shape, mass margins, 
calcifications etc), pathological information after biopsy (type of disease, if it 
is incidental or not, number of foci etc), biopsy procedure information (needle 
gauge, type of procedure etc), and other relevant facts about the patient. Prob¬ 
abilistic data was also gathered: namely the confidence in malignancy for each 
case (before excision), assigned by different physicians analysing that case. Fur¬ 
thermore, and since physicians base their conclusions in literature values from 
the universe of all biopsies, values were added in the PBK as the probability of 
malignancy given a feature value (is_malignant features). For example, it is well 
known among radiologists expert in mammography that if a mass has a spicu- 
lated margin, the probability that the associated finding is malignant is around 
90%. 

Two kinds of experiments were performed on this dataset: (i) the malignancy 
experiment consisted of finding theories by using as examples a discrete class 
variable malignancy determined after excision (either malignant or not), and 
(ii) malignancyPH experiments using as examples the probabilities assigned by 
different physicians (PHI, PH2, PH3) to the malignancy of each case. The re¬ 
sulting theories are presented in Figure [SJ These experiments were performed 
on the full training set, since they were intended to be exploratory. For each 
classifier, we report accuracy on the full training set only to illustrate differences 
between the different classifiers. Figure [5] shows the best hypotheses found using: 
PAcc metric both for ranking and evaluation; primary/secondary population of 
20/200; and generating hypotheses until length 3. The malignancy predicate is 
the best classifier found for malignancy of a tumour (experiment (i), accuracy 
= 88%). Probabilistic BK did not play an important role in this task, since the 
class variable is deterministic. Nevertheless, SkILL managed to find a good rule 



malignancy ( Patient ) : — 

distrib_Grp ( Patient , missing ) , 
aDH( Patient , ’Y’) , 
domiFinding ( Patient , mass ) . 

malignancyPHl ( Patient ) : — 

is_malignant_oval( present) . 
malignancy PH2 (Patient ): — 

fibre (Patient , ’N’) , 
is_malignant_oval( present) . 
malignancy PH3 (Patient ): — 

shape.Irr ( Patient , present) , 
is .malignant .irregular (present) . 


Fig. 5. Hypotheses for malignancy of non-definitive biopsies 


that combines a variable/value indicative of malignancy: the presence of atypical 
ductal hyperplasia (aDH), with neutral variables such as the presence of mass or 
calcification distribution grouped (distrib.Grp). 

Predicates malignancyPH are the classifiers found for experiment (ii) (accu¬ 
racies of 94%, 95% and 86%, respectively). In all malignancyPH rules, at least 
one of the probabilistic literals is present. For example, the probabilistic literal 
that corresponds to a tumour of oval shape, which is highly correlated with 
malignancy, appears in malignancyPHl and malignancyPH2. The same happens 
to the literal that represents the probability of malignancy of a tumour hav¬ 
ing an irregular shape, which appears in malignancyPHS. These results express 
the different mental models associated with each physician. These rules seem to 
indicate that some physicians give more weight to shape irregular while others 
give more importance to shape oval, besides giving weight to the Fibroepithe- 
lial lesions fibro. One of the great outcomes of these rules is that they can be 
combined and perhaps produce an even better model for all physicians. 

While it was not evident, the system found that a length 1 theory was suf¬ 
ficient to describe best the datasets studied, which is very important, specially 
in the case of the medical dataset, since physicians need to spend less time siev¬ 
ing through smaller rules. However, it is obvious that there exist problems that 
would require classifiers with multiple rules such as the motivational example of 
Section [2j In this aspect, SkILL takes advantage of its clever search and pruning 
of hypotheses combinations, being able to explore a more qualitative portion of 
the full space, whilst being able to perform both classification and prediction, 
efficiently extending the classical ILP approach. 

5 Related Work 

The PILP setting was first introduced in EQ], where three distinct settings - 
extended from traditional ILP m - are put forward: probabilistic entailment, 




prohahilistic interpretations, and probabilistic proofs. Later, Raedt and Thon pre¬ 
sented the system ProbFOIL m, which is not only capable of performing induc¬ 
tion over probabilistic examples, but also on background knowledge encoded as 
ProbLog probabilistic facts. A number of relevant metrics such as precision, ac¬ 
curacy and m-estimate are adapted from the discrete ILP domain for use in the 
new setting, and ProbFOIL’s search for a hypothesis is guided based on proba¬ 
bilistic accuracy of the theories. This system then presents a proof of concept by 
analyzing two toy examples and extracting First Order Logic (FOL) rules about 
them. However, this system does not take advantage of the probabilistic data in 
order to tune its search engine, using simply an extension of an ILP algorithm 
with a different loss function. 

Probabilistic Explanation Based Learning (PEBL) [7] can find the most likely 
FOL clause which explains a set of positive examples in terms of a database 
of probabilistic facts. The explanation clause is the combination of predicates 
which yields the highest probability based on the examples, and is found by 
constructing variabilized refutation proofs for the given examples using SLD 
resolution. However, since PEBL is a deductive system, information about the 
expected structure of the explanation should be provided as predicates (which 
are often recursive). 

Orthogonally, Markov Logic Networks (MLNs) [22] also combine structure 
learning using a FOL framework with a probabilistic Markov Random Fields 
approach [Sj. An MLN is a set of pairs of logic formulae and weights, where the 
latter are calculated based on the number of true groundings of the respective 
formula. Pairs sharing at least one variable in the same grounding are connected 
by an edge, and in fact an MLN can be thought of as generating a grounded 
Markov Network for each possible set of facts. Structure learning of MLNs can 
be done by altering the logic search space by (i) adding or removing one or more 
literals from a logic formula in a pair and (ii) inverting predicate symbols of a 
formula; both techniques are similar to operations performed in traditional ILP 
structure learning. Structure learning for MLNs softens the hypotheses by using 
probabilities and as such produces better classifiers, as shown in j^; however, 
MLNs still consider crisp background knowledge, not taking into account the 
possibility of probabilistic logic facts. Additionally, and whilst MLNs are capable 
of structure learning, the final classifier is an MLN itself, which does not have 
the advantage of readability, especially when problem sizes are larger. 

Finally, Meta-Interpretive Learning m - which is a technique aimed at per¬ 
forming predicate invention in ILP using abduction - can also be used to perform 
probabilistic structure learning by calculating prior and posterior distributions 
on the hypotheses space according to the examples explained by a given hypoth¬ 
esis [14] . Meta-Interpretive Learning makes it possible to use Bayesian theory to 
both sample the hypotheses space and evaluate hypotheses according to their 
coverage of examples. Hypotheses search space can then be summarized as a 
super-imposed logic program, where the arcs connecting atoms contain the sum 
of all arcs for each individual hypothesis. As such, this approach can learn simul¬ 
taneously the structure of the arguments of the meta rules and the parameters of 


the super-imposed logic program. This approach is similar to structure learning 
for MLNs in the sense that a relation exists between simultaneously grounded 
entities in the data and that hypotheses are ranked according to how many of 
these possible configurations they explain. However, probabilistic background 
knowledge is also not supported by meta-interpretive learning, since it does not 
support probabilistic facts. 


6 Conclusions 

This work presented the PILP learner SkILL, which extends classic ILP learners 
by incorporating probabilistic facts and rules in its BK, as well as by using prob¬ 
abilistic examples. There are different semantics which can apply to probabilistic 
data, and a toy example of probabilities used as marginal distributions was pre¬ 
sented to motivate the use of data annotated with probabilities. Then, some 
details on the setting SkILL uses were presented, namely focusing on the strat¬ 
egy to traverse the search space, the evaluation metrics applied to hypotheses 
and how to efficiently prune the search. SkILL generates theories by combining 
hypotheses using a ranking metric and always maintaining a number of ran¬ 
dom hypotheses so as to ensure that weaker candidates are still considered. The 
evaluation metrics used to select the best hypotheses and to guide the search 
are probabilistic accuracy (PAcc) and root mean square error (RMSE); they 
differ because RMSE penalizes more heavily greater errors. Since SkILL works 
on probabilistic data, a pruning strategy based on set theory and the principle 
of inclusion/exclusion was devised and implemented with the aim of increasing 
efficiency in the system. SkILL’s classification performance was assessed using 
a subset of the metabolism dataset after some data were adapted to probabili¬ 
ties. Experiments show that SkILL’s accuracy performance is better than that 
of the discrete ILP system Aleph, and that the pruning strategy does not signif¬ 
icantly alter SkILL’s final results. Finally, SkILL was used to extract non-trivial 
knowledge from a dataset of non-definitive biopsies annotated with probabilis¬ 
tic literature values. Results show that rules generated from data annotated 
with physicians degrees of belief vary, but agree with medical literature values. 
We have been working on the validation of these rules on a new unseen biopsy 
dataset. 
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